Regular expressions

On this page I will try to explain the regular expression based search in Tiny Hexer.

Overview

You can choose to use Regular expressions in Tiny Hexer's Find/Replace window to perform powerful searches. If you ever tried the wildcards "*" or "?" in a directory listing (like "dir *.txt"), you already have used a simple form of regular expressions.

Regular expressions in Tiny Hexer, however, can be much more powerful than those filename wildcards.

Imagine the following: There's a binary file containing address records, but it's corrupted somehow and cannot be opened anymore in the address book application. Now you need the address of a Mr. Johnson, Jonson, Jenssen or is it Jensen? Unfortunately you are not sure about the exact name, and you cannot find his address anywhere else. But you know that the record is stored in this binary address record file. After loading this file into Tiny Hexer you could do sequential searches for all those names in the data, but what if he's called Jehnsen or Jonnsen? Using the regular expression J[oe]h?n+s+[oe]n you can search for all these names at once.

Well..., how does it work? I'll try to answer this question in the following section.

Definitions

Regular expressions are sets of elements used to match patterns of data, that is each element (character, set or group) in a regular expression can match none, one or more characters or groups of characters (depending on branches and modifiers).

Characters

The simplest element is an ordinary character ("a", "b"...). This element matches exactly that character in the data. In the example above the character J is used to find the first character of the name (Johnson, Jensen...). Character elements can be written in hexadecimal form by being prefixed with the \x escape to allow searching for binary data (e.g. \x0A for the character with the binary value of 0A hex = 10 dec). Some characters that have a special meaning in regular expressions must be escaped using the backslash \ (unless they are written in hexadecimal notation), those characters are ., *, \, +, (, ), ?, |, [ and ].

Sets

Sets are used to match either one of the characters included in the set. In the example above, the set [oe] is used to match either the character "o" (Jon.../...son) or the character "e" (Jen.../...sen). Sets are embedded in brackets ([...]).You can include ranges of characters in sets by writing a hyphen between the first and the last value, e.g. to match alphanumerical characters, you may write [0-9a-zA-Z]. You can also use negated sets by putting the circumflex (^) after the opening bracket: If you want to search for any data except alphanumerical characters, you may write [^0-9a-zA-Z].

The following predefined sets exist (they do not need to be embedded in brackets):

"." matches any value (00 hex - FF hex)
"\a" matches any alphabetic character ('a' - 'z' and 'A' - 'Z')
"\d" matches any digit ('0' - '9')
"\w" matches any alphanumerical character ('a' - 'z', 'A' - 'Z' and '0' - '9')

Groups

Groups connect sequences of elements. They are used in conjunction with modifiers and branches (see below). Element groups are embedded in parentheses ((...)). So if you want to search for either "John" or "Jane", use the regular expression (John|Jane), if you want to search for either "John" or "Jonny", you might use Joh?n(ny)?.

Branches

Branches allow to search for either one of alternatives of groups or elements. They work like logical ORs. All elements of a branch have to be separated by a pipe character (|). So if you want to search for either "John Smith", "John Taylor" or "John Q Public", you can use the regular expression John (Smith|Taylor|Q Public).

Modifiers

Modifiers tell the matcher how often the preceding group, element or set must occur in the data.

The following modfiers exist:

"*" the preceding group, element or set might be present any time (also zero times), e.g. Jon*y matches "Joy", "Jony", "Jonny" and "Jonnny".
"?" the preceding group, element or set might be present one time (also zero times), e.g. Jon?y matches "Joy" and "Jony", but not "Jonny" or "Jonnny".
"+" the preceding group, element or set must be present at least one time, e.g. Jon+y matches "Jony", "Jonny" and "Jonnny", but not "Joy".

The modifier ? has a different meaning if it's the first character in the regular expression: It tells the matcher to be "non-greedy", that is the modifiers "eat" as little characters as possible to match the pattern. An example: The pattern T.+s finds "This is" in the text "This is a text", the pattern ?T.+s just finds "This".

Notes

The search for regular expressions is based on the Regex Library by Niche Software.
I modified the original code to (hopefully) work with different character sets and unicode files. After some (little) testing I decided to use the code I modified in Tiny Hexer. Please inform me if searching for regular expressions does not work as expected. Do not blame the author of the original library for possible mistakes in my derivative implementation!
In contrast to the usual behaviour of the . set in regular expressions, the Tiny Hexer version matches all data values, not only alphanumerical characters.
The common regular expression modifiers ^ (match line start) and $ (match line end) are not correctly implemented in my modified version of the regular expression library and thus should be avoided.
When searching in unicode files, sets do not work if they contain characters with an unicode value > 00ff hex, e.g. [€ó] will not match the Euro sign. In this case you should rather use grouped branches: (€|ó)
There might be further problems with matching regular expressions if the data file is in unicode format or in a character set other than Windows ANSI. This is due to some quick 'n' dirty hacks i made in the RegEx library derivative used in Tiny Hexer.